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Abstract 

Directed acyclic graphs (DAGs) are commonly used to represent causal relationships among random variables in 
graphical models. Applications of these models arise in the study of physical, as well as biological systems, where di- 
rected edges between nodes represent the influence of components of the system on each other. The general problem 
of estimating DAGs from observed data is computationally NP-hard, Moreover two directed graphs may be observa- 
tionally equivalent. When the nodes exhibit a natural ordering, the problem of estimating directed graphs reduces to 
the problem of estimating the structure of the network. In this paper, we propose a penalized likelihood approach 
that directly estimates the adjacency matrix of DAGs. Both lasso and adaptive lasso penalties are considered and an 
efficient algorithm is proposed for estimation of high dimensional DAGs. We study variable selection consistency 
of the two penalties when the number of variables grows to infinity with the sample size. We show that although 
lasso can only consistently estimate the true network under stringent assumptions, adaptive lasso achieves this task 
under mild regularity conditions. The performance of the proposed methods are compared to alternative methods in 
simulated, as well as real, data examples. 

1 Introduction 

Graphical models provide efficient tools for the study of statistical models through a compact representation of the joint 
probability distribution of the underlying random variables. The nodes of the graph represent the random variables, 
while the edges capture the relationships among them. Both directed and undirected edges are used to represent 
interactions among random variables. However, there is a conceptual difference between these two types of edges: 
while undirected edges are used to represent similarity or correlation, directed edges are usually interpreted as causal 
relationships. The study of directed edges is therefore directly related to the theory of causality, and of main interest 
in many applications. A special class of directed graphical models (also known as Bayesian Networks) are based on 
directed acyclic graphs (DAGs), where all the edges of the graph are directed and there are no directed cycles present in 
the graph. DAGs are used in graphical models and belief networks and have been the focus of research in the computer 
science literature (see Pearl (2000)). Important applications involving DAGs also arise in the study of biological 
systems, as many cellular mechanisms are known to include causal relationships. Cell signalling pathways and gene 
regulatory networks are two examples, where causal relationships play an important role (Markowetz and Spang, 
2007). 

The problem of estimating DAGs is an NP-hard problem, and estimation of direction of edges may not be possible 
due to observational equivalence (see section 2). Most of the earlier methods for estimating DAGs correspond to greedy 
search algorithms that search through the space of possible DAGs. A number of methods are available for estimating the 
structure of DAGs for small to moderate number of nodes. The max-min hill climbing algorithm (Tsamardinos et al., 
2006), and the PC-Algorithm (Spirtes et al., 2000) are two such examples. However, the space of possible DAGs grows 
super-exponentially with the number of variables (nodes), and estimation of DAGs using these methods, especially in a 
small n, large p setting, becomes impractical. Bayesian methods of estimating DAGs (e.g Heckerman et al., 1995) are 
also computationally very intensive and therefore not particularly appropriate for large graphs. Kalisch and Buhlmann 
(2007) recently proposed an implementation of the PC-Algorithm with polynomial complexity that can be used for 
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estimation of high dimensional sparse DAGs. However, when the variables inherit a natural ordering, estimation 
of a DAG is reduced to estimating its structure or skeleton. Applications with natural ordering of variables include 
estimation of causal relationships from temporal observations, or settings where additional experimental data can 
determine the ordering of variables, and estimation of transcriptional regulatory networks from gene expression data. 
Examples of these applications are presented in section 6. 

The structure of the graph can be determined from conditional independence relations among random variables. 
For undirected graphs, this is equivalent to learning the structure of the conditional independence graph (CIG), which 
in the case of Gaussian random variables, is determined by zeros in the inverse covariance matrix (also known as 
precision or concentration matrix). Different penalization methods, have been recently proposed to obtain sparse 
estimates of the concentration matrix. Meinshausen and Biihlmann (2006) considered an approximation to the problem 
of sparse inverse covariance estimation using the lasso penalty. They showed under a set of assumptions, that their 
proposed method correctly determines the neighborhood of each node. Banerjee et al. (2008) and Friedman et al. 
(2008b) explored different aspects of the problem of estimating the concentration matrix using the lasso penalty, while 
Yuan and Lin (2007) and Fan et al. (2007) considered other choices for the penalty. Rothman et al. (2008) proved 
consistency in Frobenius norm, as well as in matrix norm, of the l\ -regularized estimate of the concentration matrix 
when p ;§> n, while Lam and Fan (2008) extended their result and considered estimation of matrices related to the 
precision matrix, including the Cholesky factor of the inverse covariance matrix, using general penalties. Penalization 
of the Cholesky factor of the inverse covariance matrix has been also considered by Huang et al. (2006), where they 
used the lasso penalty in order to obtain a sparse estimate of the inverse covariance matrix. This method is based on 
the regression interpretation of the Cholesky factorization model and therefore requires the variables to be ordered a 
priori. 

In this paper, we consider the problem of estimating the skeleton of DAGs, where the variables exhibit a natural 
ordering. We use graph theoretic properties of DAGs and reformulate the likelihood as a function of the adjacency 
matrix of the graph. We then exploit the ordering of variables to propose an efficient algorithm for estimation of 
structure of DAGs, which offers considerable improvement in terms of computational complexity. Both lasso and 
adaptive lasso penalties are considered and variable selection consistency of estimators is established in the p 3> n 
setting. In particular, we show that although lasso is only variable selection consistent under stringent conditions, 
adaptive lasso can consistently estimate the true DAG under the usual regularity assumptions. We also present a data 
dependent choice of the tuning parameter that controls the probability of errors. Theoretical as well as empirical 
evidence shows that when the underlying causal mechanism in the network is linear, the proposed method can also be 
applied to non-Gaussian observations. Finally, additional simulations indicate that although the proposed method is 
derived based on the ordering of variables, the method is not sensitive to random permutations of the order of variables 
in high dimensional sparse settings. 

2 Representation of Directed Acyclic Graphs 

Consider a graph Q = (V, E), where V corresponds to the set of nodes with p elements and E C V x V to the 
edge set. The nodes of the graph represent random variables X\,..., X p and the edges capture associations amongst 
them. An edge is called directed if G E =>■ (j, i) £ E and undirected if G E =^> (J, i) G E, The main 
focus of this paper is a special class of graphs where E consists of only directed edges, and does not include directed 
cycles. We denote by pai the set of parents of node i and for j G pai, we denote j — > i. The skeleton of a DAG is 
the undirected graph that is obtained by replacing directed edges in E with undirected ones. Finally, throughout this 
paper, we represent E using the adjacency matrix A of the graph; i.e. a p x p matrix whose (j, z)th entry indicates 
whether there is an edge (and possibly its weight) between nodes j and i. 

The estimation of DAGs is a challenging problem due to the so-called observational equivalence of DAGs with 
respect to the same probability distribution. More specifically, regardless of the sample size, it may not be possible to 
infer the direction of causation among random variables from observational data. As an illustration of the observational 
equivalence, consider the simple DAG in the right panel of Figure 1 . Reversing the direction of all edges of the graph 
results in a new DAG, which is the same as the original graph, except for changes in the node labels and is therefore 
polymorphic to the original one. It is therefore natural to estimate the equivalence class of DAGs corresponding to 
the same probability distribution V starting with the skeleton of the network. The second challenge in estimating 
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Figure 1 : Left: A simple DAG, Right: Illustration of observational equivalence in DAGs 

DAGs is that conditional independence among random variables may not reveal the skeleton. The notion of conditional 
independence in DAGs is either represented using the concept of d-separation (Pearl, 2000) or the moral graph of 
a DAG (Lauritzen, 1996). The moral graph is obtained by removing the directions of the graph and "marrying" the 
parents of each node. Therefore estimation of the conditional independence structure reveals the structure of the moral 
graph of the DAG, which includes additional edges between parents of each node. This can also be illustrated using the 
simple graph in the left panel of Figure 1. Suppose X i: i = 1, . . . , 4 are normally distributed with covariance matrix 
S. The only zero elements of the inverse covariance matrix are E^ 1 = E^ 1 , as Xi and A3 are connected in the moral 
graph of Q. 

2.1 The Latent Variable Model 

The causal effect of random variables in a directed acyclic graph can be explained using structural equation models, 
where each variable is modeled as a (nonlinear) function of its parents. The general form of these models is given by 
(Pearl, 2000): 

X% = fi(pai,Zi), i = l,...,p (2.1) 

The random variables Zi are the latent variables representing the unexplained variation in each node. To model the 
association among nodes of a DAG, we consider a simplification of (2. 1) with fa being linear. More specifically, let pij 
represent the effect of node j on i for j £ pau then 

Xj = pijXj+Zi, i=l,...,p (2.2) 

In the special case where the random variables are Gaussian, equations (2.1) and (2.2) are equivalent in the sense 
that are coefficients of the linear regression model of Xi's on Xj,j £ pai. It is known in the normal case that 
Pij = 0, j £ pai. 

Consider the simple DAG in the left panel of Figure 1; denoting the influence matrix of the graph by A, (2.2) can 
be written in compact form as X — AZ, where for the simple example above, we have 

/ 1 

A = pia 10 

V P12P23 P23 1 

Let the latent variables Z% be independent with mean /Ltj and variance of. Then E(A) = A/i and £ = var(A) = 

ADA 1 , where D = diag (of ) and A T denotes the transpose of the matrix A. 

The following result from Shojaie and Michailidis (2009a) establishes relationships between the influence matrix 
A, and the adjacency matrix of the graph, A. The second part of the lemma establishes a compact relationship between 
A and A in the case of DAGs, which is explored in section 3 to directly formulate the problem of estimating the skeleton 
of a DAG. 

Lemma 2.1. For any graph Q = (V, A), 

(i) A = A + A 1 + A 2 + ■ ■ ■ = J2T=o Ar > where A ° = L 

(ii) IfQ is a DAG, A has full rank and A = (I — A)^ 1 . 
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Remark 2.2. Part (ii) of Lemma 2.1 and the fact that E = ADA 1 imply that for any DAG, if Da > for all i, 
then E is full rank. More specifically, let <f)j(M) denote the jth eigenvalue of matrix M. Then, ^ m ; n (E) > (or 
<^max(E -1 ) < oo). Similarly, since E _1 = A~ T D~ 1 A~ 1 , full rankness of A implies that 0mi n (E _:L ) > (or 
equivalently 4> ma , x (T,) < oo). This result also applies to all subnetworks of a DAG. 

The properties of the proposed latent variable model established in Lemma 2. 1 are independent of the choice of 
probability distribution V. In fact, since the latent variables Zi in (2.2) are assumed independent, given the entries 
of the adjacency matrix, the distribution of each random variable Xi in the graph only depends on the values of pai. 
Therefore, regardless of the choice of the probability distribution, the joint distribution of the random variables is 
compatible with Q (see for example Pearl (2000) p. 16). In section 5, we illustrate this result using data generated 
according to non-Gaussian distributions. 

3 Penalized Estimation of dags 
3.1 Problem Formulation 

Consider the latent variable model of section 2. 1 and denote by X the n x p data matrix. We assume, without loss 
of generality, that the Xi's are centered and scaled, so that \ii = and of = 1, i = 1, . . . ,p. Note that the results in 
section 2. 1 were established independent of the choice of the probability distribution. As mentioned before, under the 
normality assumption, the latent variable model is equivalent to the general structural equation model. Although we 
focus on Gaussian random variables in the remainder of this paper, the estimation procedure proposed in this section 
can be applied to a variety of other distributions, if one is willing to assume the linear structure in (2.2). 

Denote by = the precision matrix of a p- vector of Gaussian random variables and consider a general 
penalty function by J(f2). The penalized likelihood function is then given by 

fl = argmin{- logdct (fl) + tr (CIS) + \J(tt)} (3.1) 
nyo 

where S = n~ 1 X 1 X denotes the empirical covariance matrix and A is the tuning parameter controlling the size of the 
penalty. Applications in biological and social networks often involve sparse networks. It is therefore desirable to find a 
sparse solution for (3.1). This becomes more important in the small n, large p setting, where the unpenalized solution 
includes many additional edges. The lasso penalty of Tibshirani (1996) and the adaptive lasso penalty proposed by Zou 
(2006) are singular at zero and therefore result in sparse solutions. We consider these two penalties in order to find a 
sparse estimate of the adjacency matrix. Other choices of the penalty function are briefly discussed in the conclusions 
section. 

Using the latent variable model of section 2. 1 , and the relationship between the covariance matrix and the adjacency 
matrix of DAGs established in Lemma 2.1, the problem of estimating the adjacency matrix of the graph can be directly 
formulated as an optimization problem based on A. As noted in Lemma 2. 1, if the underlying graph is a DAG and the 
ordering of the variables is known, then A is a lower triangular matrix with zeros on the diagonal. Let A — {A : Aij = 
0, j > i}- Then using the facts that dct(A) = 1 and of = 1, A can be estimated as the solution of the following 
optimization problem 

A = argmin {tr [{I - A) 1 {I - A)S] + XJ(A)} (3.2) 
AeA 

In this paper, we consider the general weighted lasso problem, where 

J(A) = X w *Mv\ (3-3) 

i,j=l'P,3<i 

Lasso and adaptive lasso problems are special cases of this general penalty. In the case of lasso, my = 1. The original 
weights in adaptive lasso, proposed by Zou (2006) are obtained by setting my = \Ai 3 ■ |~ 7 , for some initial estimate of 
the adjacency matrix A and some power 7. To facilitate the study of asymptotic properties of adaptive lasso estimates 
we consider the following modification of the original weights 

my = lV|iyp (3.4) 
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where the original estimates A are obtained from the regular lasso estimates. 

The objective function for both lasso and adaptive lasso problems is convex. However, since the l\ penalty is 
non-differentiable, these problems can be reformulated using matrices A + = max(A, 0) and A_ = — min(A, 0). To 
that end, let W be the pxp matrix of weights for adaptive lasso, or the matrix of ones for the lasso estimation problem. 
Problem (3.2) can then be formulated as: 

min tr{S(I-A+ + A^) 1 (I - A+ + AJ) + X(A + + A_)W + A(A + + (3.5) 

where ^ is interpreted componentwise, A is a large positive number and lj+ is the indicator matrix for lower 
triangular elements of a p x p matrix, including the diagonal elements. The last term of the objective function 
(tr[A(A + + prevents the upper triangular elements of the matrices A + and A- to be nonzero. 

Problem (3.5) is a quadratic optimization problem with non-negativity constraints and can be solved using stan- 
dard interior point algorithms. However, such algorithms do not scale well with dimension and are only applicable 
if p ranges in the hundreds. In section 3.2, we present an alternative formulation of the problem, which leads to 
considerably more efficient algorithms. 



3.2 Optimization Algorithm 

Consider again the problem of estimating the adjacency matrix of DAGs with either lasso or adaptive lasso penalties. 
Denoting the ith row of matrix A as Ai we can write (3.2) as: 

A = argmin J V A t T SA t - 2AA 1 + 1 (3.6) 



AeA 



. 1=1 



It can be seen that the objective function in (3.6) is separable and therefore it suffices to solve the optimization problem 
over each row of matrix A. Denote by I the set of indices up to I, i.e. l=j:l<j<l. 

Then, taking advantage of the lower triangular structure of A, solving (3.6) is equivalent to solving the following 
p— 1 optimization problems (An = 0) 

Ai^x = argmin J (TSi-i^-xB - 2S M _i0 + A V \6A Wi j 1, i = 2,...,p (3.7) 

* eR< -' { i=i J 

Using the facts that Sj-i^-i = n^ 1 (X^i-i) 1 Xn,i-\ an d = n^ 1 (Xn.%) 1 <%n,i-i> me problem in (3.7) can be 

reformulated as following l\ -regularized least squares problems 

= argmin J n^X^i-iB - Xn,i\\l + *i Y\ \> i = 2,...,p (3.8) 

The formulation in (3.8) indicates that the ith row of matrix A includes the coefficient of projecting Xi on Xj, j = 
1, 1, which is in agreement with the discussion in section 2.1. It also reveals a connection between covariance 
selection methods and the neighborhood selection approach of Meinshausen and Biihlmann (2006); namely, when the 
underlying graph is a DAG, the approximate solution of the neighborhood selection problem is exact, if the regression 
model is fitted on the set of parents of each node instead of all other nodes in the graph. 

Using (3.8), the problem of estimating DAGs can be solved very efficiently. In fact, it suffices to solve p — 1 
lasso problems for estimation of least squares coefficients, with dimensions ranging from 1 to p — 1. To solve these 
problems, we use the efficient pathwise coordinate optimization algorithm of Friedman et al. (2008a), implemented in 
the R-package glmnet. The proposed algorithm is summarized in Algorithm 1. 



3.3 Analysis of Computational Complexity 

In this section, we provide a comparison of the computational complexity of the algorithm proposed in section 3.2 
and the PC-Algorithm. As mentioned in the introduction, the space of all possible DAGs is super-exponential in the 
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Figure 2: CPU time for analysis of simulated DAGs with different number of nodes and sample size. The results are 
average times of 10 repetitions on an Intel 2 ■ OGH processor with 2 • 0GB of RAM. 

number of nodes and hence it is not surprising that the PC-Algorithm, without any restriction on the space of DAGs 
has exponential complexity. Kalisch and Biihlmann (2007) propose an efficient implementation of the PC-Algorithm 
for sparse DAGs; its complexity where the maximal neighborhood size q is small, is bounded with high probability by 
0(p q ). Although this is a considerable improvement over other methods of estimating DAGs, in many applications it 
can become fairly expensive. For example, gene regulatory networks and signaling pathways exhibit a "hub" structure, 
which leads to large values for q. The graphical lasso algorithm, proposed by Friedman et al. (2008b), uses an iterative 
algorithm for estimation of the inverse covariance matrix and has computational complexity 0(p 3 ). 

The reformulation of the DAG estimation problem in (3.8) requires solving p— 1 lasso regression problems. The cost 
of solving a lasso problem comprised of k covariates and n observations using the pathwise coordinate optimization 
(shooting) algorithm of Friedman et al. (2008a) is 0(nk); hence, the total cost of estimating the adjacency matrix of 
the graph is 0(np 2 ), which is the same to the cost of calculating the (full) empirical covariance matrix S n . Moreover, 
the formulation in (3.8) includes a set of non-overlapping sub-problems. Therefore, for problems with very large 
number of nodes and/or observations, the performance of the algorithm can be further improved by parallelizing the 
estimation of these sub-problems. The adaptive lasso version of the problem is similarly solved using the modification 
of the regular lasso problem proposed in Zou (2006), which results in the same computational cost as the regular lasso 
problem. 

Figure 2, compares the CPU time required for estimation of DAGs using both the PC-Algorithm, as well as our 
proposed algorithm for a range of values of p and n. To control the complexity of the PC algorithm, the average 
neighborhood size is set to 5 and the significance level for the PC-Algorithm, as well as the tuning parameter for lasso 
and adaptive lasso penalties are set according to the optimal values discussed in Section 5. The time reported for the 
PC- Algorithm is the CPU time required for estimation of the skeleton of the graph. The plot demonstrates the higher 
order of complexity of the PC- Algorithm, as well as the dependency of the algorithm on the sample size. 



Algorithm 1 Penalized Likelihood Estimation of DAGs 

1 . Given the ordering O, order the columns of observation matrix X in increasing order 

2. For i = 2, 3, . . . ,p 

2.1. Let y = X R:i , X = Xn,i-i and w = Wjj-i 

2.2. Given the weight matrix W, solve Ai^-i = argmin < n _1 \\X9 — j/Hf + Aj $3}=i l^'l^i 
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4 Asymptotic Properties 



4.1 Preliminaries 

We next establish theoretical properties for both lasso, as well as adaptive lasso estimators of the adjacency matrix 
of DAGs. Asymptotic properties of lasso-type estimates with fixed design matrices have been studied by a number 
of researchers (see e.g. Knight and Fu, 2000; Zou, 2006; Huang et al., 2008). Lasso estimates with random design 
matrices have been also considered by Meinshausen and Biihlmann (2006). On the other hand, Rothman et al. (2008) 
and Lam and Fan (2008) among others have studied asymptotic properties of estimates of covariance and precision 
matrices. 

As discussed in section 3.1, the problem of estimating the adjacency matrix of a DAG is equivalent to solving p— 1 
non-overlapping penalized least square problems described in (3.7). In order to study the asymptotic properties of the 
proposed estimators, we focus on the asymptotic consistency of network estimation, i.e. the probability of correctly 
estimating the network structure, in terms of type I and type II errors. We allow the total number of nodes in the graph 
to grow as an arbitrary polynomial function of the sample size, while assuming that the true underlying network is 
sparse. The assumptions required for establishing asymptotic properties of lasso and adaptive lasso estimates of DAGs 
are presented in section 4.2. We then study variable selection consistency of both lasso and adaptive lasso estimates in 
section 4.3. The choice of penalty parameter is studied in section 4.4. Technical proofs are given in the Appendix. 



4.2 Assumptions 

Let X = (Xi, . . . , X p ) be a collection of p zero-mean Gaussian random variables with covariance matrix E. Denote 
by X the n x p matrix of observations, and by S the empirical covariance matrix. To simplify the notation, denote by 
9 % = A it i_i the entries of the ith row of A to the left of the diagonal. Further, let 6 % ' z be the estimate for the ith row, 
with values outside the set of indices I set to zero; i.e., 9 hX = Ai^i and Ajj = 0, j ^ 2. Consider the following 
assumptions: 

(A-0) For some a > 0, p = p(n) = 0(n a ) as n — > oo, and there exists a < b < 1 such that maxi 6 y card (jpCLj) = 
0(n b ) as n — > oo. 

(A-l) There exists v > such that for all n e N and all i e V, var [X t \ X^A > v. 

(A-2) There exists 5 > and some £ > b (with b defined above) such that for all i € V and for every j G pa,, 
Kijl > <5n~( 1 ~'>)/ 2 , where tt^ is the partial correlation between Xi and Xj after removing the effect of the 
remaining variables. 

(A-3) There exists ^ < oo such that for all n E N and every i EV and for every j € pa,i, \\0 : '' pai \\ < 



(A-4) There exists n < 1 such that for alH S V and for every j £ pat, J^kepa sign(6'^ ^ ' 0l )6' , 



< K. 



Assumption (A-3) limits the magnitude of the effects that each node in the network receives from its parents. This 
is less restrictive than the neighborhood selection criterion, where the effects over all neighboring nodes are assumed 
to be bounded. In fact, empirical data indicate that the average number of upstream-regulators per gene in regulatory 
networks is less than 2 (Leclerc, 2008). Thus, the size of parents of each node is small, but each hub node can affect 
many downstream nodes. 

Assumption (A-4) is referred to as neighborhood stability and is equivalent to the irrepresentability assumption 
proposed by Huang et al. (2008). It has been shown that the lasso estimates are not in general variable selection 
consistent if this assumption is violated. Huang et al. (2008) considered adaptive lasso estimates with general initial 
weights and showed their variable selection consistency under a weaker form of irrepresentability assumption, referred 
to as adaptive irrespresentability . We will show that when the initial weights for adaptive lasso are derived from the 
regular lasso estimates (as in (3.4)), the assumption of neighborhood stability, as well as the less stringent assumption 
(A-3) are not required for establishing variable selection consistency of adaptive lasso. This relaxation in assumptions 
required for variable selection consistency, is a result of the consistency of regular lasso estimates, as well as the special 
structure of DAGs. However, the results of this section can be extended to adaptive lasso estimates of the precision 
matrix, as well as regression models with fixed and random design matrices, under additional mild assumptions. 
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4.3 Asymptotic Consistency of dag Estimation 

Our first result studies the variable selection consistency of the lasso penalty. 

Theorem 4.1 (Variable Selection Consistency of Lasso). Suppose that (A-l )-(A-4) hold and A x drr { ~ v ~ c ^l 2 for some 
b < £ < £ and d > 0. Then there exist constants cu\, . . . , cu v \ > Ofor the lasso estimation problem, such that for all 
i G V, as n — > oo 

1. ( i) ] Estimation of the direction of influence: 

pr jsign(6^ pai ) = sign (0 l j pai )j ? or all j G pa;} = 1 - O {cxp (~c (i) n c )} 

(ii) Control of type I error: pr (pa i C pai) = 1 — {exp (— c^^n^)}. 
(hi) Control of type II error: pr (pa,; C pa^) = 1 — {cxp (— C( iii )7i 1 ')}. 
f/vj Let E be the lasso estimate for the set of edges in the network. Then 

py(E = E) = l-0 {exp (-c (iv) n c )} . 

Proof. The proof of this theorem follows from arguments similar to those presented in Meinshausen and Biihlmann 
(2006), with minor modifications and replacing conditional independence in undirected graphs with d-separation in 

DAGS. □ 

The next result establishes similar properties for adaptive lasso estimates, without the assumptions of neighborhood 
stability. The proof of Theorem 4.3 makes use of consistency of a class of sparse estimates of the Cholesky factors 
of covariance matrices, established in Theorem 9 of Lam and Fan (2008). For completeness, we restate a simplified 
version of the theorem for our lasso problem, for which a% = l,i = 1, . . .p and the eigenvalues of the covariance 

matrix are bounded (see Remark 2.2). Throughout this section, we denote by s the total number of nonzero element 
of the true adjacency matrix, A of the DAG. 

Theorem 4.2 (Lam and Fan (2008)). //n _1 (s + l)logp = o(l) and A = O |(logp/n) 1/2 |, then \\A - A\\ F = 
O p [{n- l s\ogp) 1/2 ). 

It can be seen from Theorem 4.2 that lasso estimates are consistent as long as n _1 (s + 1) logp = o(l). To take 
advantage of this result, we replace (A-0) with the following assumption 

1. (A-0')] For some a > 0, p = p(n) = 0{n a ) as n — > oo. Also, maxigy card (pai) = 0(n b ) as n — > oo, where 
sn 2b ~ 1 logn = o(l) as n — > oo. 

Assumption (A-0') further restricts the number of parents of each node and also enforces a restriction on the total 
number of nonzero elements of the adjacency matrix. Condition sn 26-1 logn = o(l), implies thatfe < 1/2. Therefore, 
although the consistency of adaptive lasso in Theorem 4.3 is established without making any further assumptions on 
the structure of the network (compared to Theorem 4. 1), it is achieved at the price of requiring higher degree of sparsity 
in the network. We now state the main result regarding variable selection consistency of adaptive lasso. Note that the 
theorem only requires assumptions (A-0'), (A-l) and (A-2), and assumptions (A-3) and (A-4) are no longer required. 

Theorem 4.3 (Variable Selection Consistency of Adaptive Lasso). Consider the adaptive lasso estimation problem, 
where the initial weights are calculated using regular lasso estimates of the adjacency matrix of the graph in (3.7). 
Suppose (A-0 1 ) and (A-l)-(A-2) hold and A X g?ti - ( 1- ^/ 2 for some b < £ < £ and d > 0. Also suppose that the 
initial lasso estimates are found using a penalty parameter A that satisfies X° = O | (logp/n) 1 ^' 2 1. Then there exist 
constants C(j\, . . . , C(s v \ > such that for all i £ V, as n — > oo (i)-(iv) in Theorem 4.1 hold. 

Proof. A proof is given in the Appendix. □ 
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4.4 Choice of the Tuning Parameter 



Both lasso, as well as adaptive lasso estimates of the adjacency matrix, depend on the choice of the tuning parameter 
A. Different methods have been proposed for selecting the value of tuning parameter, including cross validation 
(Rothman et al., 2008; Fan et al., 2007) and Bayesian Information Criteria (Bic) (Yuan and Lin, 2007). In both of 
these methods, the value of A is chosen to minimize the (penalized) likelihood function. However, choices of A that 
result in the optimal classification error do not guarantee a small error for the network reconstruction. We propose next 
a choice of A for DAGs. Consider the following choice of the tuning parameter for the general weighted lasso problem 
with weights my. Let Z* denote the (1 — q)th quantile of standard normal distribution, and define 



The following theorem, states that such a choice controls the probability of falsely joining two distinct ancestral sets, 



Definition 4.4. For every node i 6 V, the ancestral set of node i, ANi consists of all nodes j such that j is an ancestor 
of i or i is an ancestor of j or i and j have a common ancestor k. 

Theorem 4.5 (Choice of the Tuning Parameter). Under the assumptions of Theorems 4.1 and 4.3 above (for lasso and 
adaptive lasso, respectively), for all n £ N the solution of the general weighted lasso estimation problem with tuning 
parameter determined in (4.1) satisfies 



Note that as in Meinshausen and Biihlmann (2006), Theorem 4.5 is true for all values of p and n. However, the 
theorem does not provide any guarantee on false positive or false negative probabilities for individual edges in the 
graph. We also need to determine the optimal choice of penalty parameter A for the first phase of the adaptive lasso, 
where the weights are estimated using lasso. Since the goal of the first phase is to achieve prediction consistency, 
cross validation can be used to determine the optimal choice of A . On the other hand, it is easy to see that the 
error-based proposal in (4.1) satisfies the requirement of Theorem 4.2 and can therefore be used to define A . It is 
however recommended to use a higher value of significance level in estimating the initial weights, in order to prevent 
an over-sparse solution. 

5 Performance Analysis 
5.1 Preliminaries 

In this section, we consider examples of estimating DAGs of varying number of edges from randomly generated data. 
To randomly generate data from DAGs, one needs to generate lower-triangular adjacency matrices with sparse nonzero 
elements, p^. We use the random DAG generator in the R-package pcalg (Kalisch and Biihlmann (2007)) which also 
controls the neighborhood size. The sparsity levels in DAGs with different sizes are set according to the theoretical 
bounds in section 4, as well as the recommendations of (Kalisch and Biihlmann, 2007), for neighborhood size. More 
specifically, in simulations throughout this section, we use a maximum neighborhood size of 5, while limiting the total 
number of true edges to be equal to the sample size n. 

Different measures of structural difference can be used to evaluate the performance of estimators. The Structural 
Hamming Distance (SHD) between the structures of the estimated and true DAGs, represents the number of edges 
that are not in common between the two graphs and is equal to the sum of false positive and false negative edges in 
the estimated graph. The main drawback of this measure is its dependency on the number of nodes, as well as the 
sparsity of the network. The second measure of goodness of estimation considered here is the Matthews Correlation 
Coefficient (MCC). MCC is commonly used to assess the performance of binary classification methods and is defined 




(4.1) 



defined next. 



pr(3i G V : ANi £ AN t ) < a 



Proof. A proof is given in the Appendix. 



□ 



as 



MCC = 



(TP X TN) - (FP X FN) 



(5.1) 



(TP + FP)(TP + FN)(TN + FP)(TN + FN) 



1/2 
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where TP, TN, FP and FN correspond to true positive, true negative, false positive and false negative, respectively. 
The value of MCC ranges from —1 to 1 with larger values corresponding to better fits (—1 and 1 represent worst and 
best fits, respectively). Finally, in order to compare the performance of different estimation methods with theoretical 
bounds established in section 4.3, we also report the values of false positive rates. 

The performance of both the PC-Algorithm as well as our proposed estimators based on the choice of tuning 
parameter in (4.1) vary with different values of significance level a. In the following experiments, we first investigate 
the appropriate choice of a for each estimator. We then compare the performance of the estimators with an optimal 
choice of lambda using both numerical measures of performance, as well as gray-scale images of the estimated DAGs 
with the true structure. Gray-scale images are obtained by calculating the proportion of times that a specific edge is 
present in the simulations (i.e. Ay ^ 0). To offset the effect of numerical instability, we consider an edge present, if 

\h\ >io- 4 - 



5.2 Estimation of dags from Normally Distributed Observations 

We begin with an example that illustrates the differences between estimation of DAGs and the conditional independence 
networks (CIG). The first two images in Figure 3 represent a randomly generated DAG of size 50 along with the gray- 
scale image of the average precision matrix estimated based on 100 observations using the graphical lasso algorithm 
(implemented in the R package glasso). To control the probability of falsely connecting two components of the 
graph, the value of the tuning parameter for glasso is defined based on the error-based proposal of Banerjee et al. 
(2008) (Theorem 2). It can be seen that the CIG has many more edges (8% false positives compared to 1% for lasso and 
adaptive lasso), and does not reveal the true structure of the underlying DAG. It can be seen that although methods of 
estimating ClGs are computationally efficient, they should not be used in applications like estimation of gene regulatory 
networks, where the underlying graph is directed. In simulations throughout this section, the sample size is fixed at 
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Figure 3: True DAG along with estimates from Gaussian observations using glasso, pcalg, lasso and Alasso. 
Gray scale represents the percentage of inclusion of edges. 

n = 100, and estimators are evaluated for an increasing number of nodes (p = 50, 100, 200). Figure 4 shows the 
mean and standard deviation of Hamming distances for estimates based on the PC- Algorithm (pcalg), as well as 
the proposed lasso (lasso) and adaptive lasso (Alasso) methods for different values of the tuning parameter a and 
different network sizes. For all values of p and a, the adaptive lasso estimate gives the best results, and the proposed 
penalized likelihood methods outperform the PC- Algorithm. This difference becomes more significant as the size of 
the network increases. 

As mentioned in section 2, it is not always possible to estimate the direction of the edges of a DAG and therefore, 
the estimate from the PC-Algorithm may include undirected edges. Since our penalized likelihood methods assume 
knowledge of the ordering of variables and estimate the structure of the network, in the simulations considered here, 
we only estimate the skeleton (structure) of the network using the PC-Algorithm. We then then use the ordering of 
the variables to determine the direction of the edges. The performance of the the PC-Algorithm for estimation of the 
partially completed DAG (PCDAG) may therefore be worse than the results reported here. 

In the simulation results reported here, observations are generated according to the linear structural equation model 
(2.2) with standard normal latent variables and pij = p = • 8. Additional simulation studies with different values 
of a and p indicate that changes in a do not have a considerable affect on the performance of the proposed models. 
On the other hand, as the magnitude of p decreases, the performance of the proposed methods (as well as the pcalg) 
algorithm deteriorates, but the findings of the above comparison remain unchanged. 
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Figure 4: Hamming Distance for estimation of DAG using pcalg, lasso and Alasso from normal observations. 
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Figure 5: MCC, FP and TP for estimation of DAG using pcalg, lasso and Alasso from normal observations. 



The above simulation results suggest that the optimal performance of the PC-Algorithm is achieved when a = 
0.01. The performance of lasso and adaptive lasso methods is less sensitive to the choice of a; however, a = 0.10 
seems to deliver more reliable estimates. Similar results were also observed in simulations with other choices of a 
and p. In addition, our extended simulations indicate that the performance of adaptive lasso does not vary significantly 
with the choice of power (7), and therefore we only present the results for 7 = 1. Figure 3 represents images of 
estimated and true DAGs created based on the above considerations for tuning parameters for p = 50. Similar results 
are observed for p = 100, p = 200, and are excluded to conserve space. Plots in figure 5 compare the performance of 
the three methods with the optimal settings of tuning parameters, over a range of values of p. It can be seen that the 
values of MCC confirm the above findings based on SHD. However, false positive and true positive rates only focus on 
one aspect of estimation at a time and do not provide a clear distinction between the methods. These plots also suggest 
that estimates from the pcalg may vary significantly. This variation is represented more significantly in terms of 
false positive rates, where standard deviations for estimates based are up to 10 times larger than those of lasso and 
Alasso estimates (~ 40% for pcalg compared to < 4% for lasso and Alasso). 

As mentioned in section 2, the representation of conditional independence in DAGs adapted in our proposed algo- 
rithm, is not restricted to normally distributed random variables. Moreover, if the underlying structural equations are 
linear, the method proposed in this paper can correctly estimate the underlying DAG. In order to assess the sensitivity 
of the estimates to the underlying distribution, we performed two simulation studies with non-Normal observations. In 
both simulations, observations were generated according to a linear structural model. However, in the first simulation, 
each latent variable was generated from a mixture of a standard normal and a t-distribution with 3 degrees of freedom, 
while in the second simulation, a t-distribution with 4 degrees of freedom was used. The performance of the proposed 
algorithm for non-normal observations was similar to the case of Gaussian observations, with Alasso providing the 
best estimates, and the performance of penalized methods improving as the dimension and sparsity increase. 

5.3 Sensitivity to Perturbations in the Ordering of the Variables 

Algorithm 3.2 assumes a known ordering of the variables. The superior performance of the proposed penalized likeli- 
hood methods in comparison to the PC-Algorithm may be explained by the fact that additional information about the 
order of the variables significantly simplifies the problem of estimating DAGs. Therefore, when such additional infor- 
mation is available, estimates using the PC-Algorithm suffer from a natural disadvantage. However, as the underlying 
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network becomes more sparse, the network includes fewer complex structures and it is expected that the ordering of 
variables should play a less significant role. 

Next, we study the performance of the proposed penalized likelihood methods as well as the PC-Algorithm in 
problems where the ordering of variables is unknown. To this end, we generate normally distributed observations 
from the latent variable model of section 2.1. We then randomly permute the order of variables in the observation 
matrix and use the permuted matrix to estimate the original DAG. Figure 6 illustrates the performance of the three 
methods for choices of a described in section 5.2. It can be seen that for small, dense networks, the PC-Algorithm 
outperforms the proposed methods. This is expected since the change in the order of variables causes the algorithm 
to include unnecessary moral edges, while failing to recognize some of the existing associations. On the other hand, 
as the size of the network and correspondingly the degree of sparsity in the network increase, the local structures 
become simpler and therefore the ordering of the variables becomes less crucial. Thus, the performance of penalized 
likelihood algorithms is improved compared to that of the PC-Algorithm. For the high dimensional sparse case, 
where the computational cost of the PC- Algorithm becomes more significant, both lasso and adaptive lasso methods 
outperform the PC-Algorithm. 





pcatg 


i 


lasso 


■ + 


AIss30,y = 1 






Figure 6: MCC, FP and TP for estimation of DAG using pcalg, lasso and Alasso with random ordering. 



6 Real Data Application 

6.1 Analysis of Cell Signalling Pathway Data 

Sachs et al. (2003) carried out a set of flow cytometry experiments on signaling networks of human immune system 
cells. The ordering of the connections between pathway components were established based on perturbations in cells 
using molecular interventions and we consider the ordering to be known a priori. The data set includes p = 1 1 proteins 
and n = 7466 samples. 

Friedman et al. (2008b) analyzed this data set using the glasso algorithm. They estimated the graph for a range 
of values of the l\ penalty and reported moderate agreement (around 50% false positive and false negative rates) 
between one of the estimates and the findings of Sachs et al. (2003). True and estimated signaling networks using 
the PC-Algorithm, as well as both lasso and adaptive lasso algorithms, along with performance measures are given in 
Figure 7. The estimated network using the pcalg includes a number of undirected edges. As in the simulation studies, 
we only estimate the structure of the network using the pcalg and determine the direction of edges by enforcing the 
ordering of nodes in the DAG. It can be seen that the adaptive lasso and lasso provides estimates that are closest to the 
true structure. 



6.2 Transcription Regulatory Network of E-coli 

Transcriptional regulatory networks play an important role in controlling the gene expression in cells and incorpo- 
rating the underlying regulatory network results in more efficient estimation and inference (Shojaie and Michailidis, 
2009a,b). Kao et al. (2004) proposed to use Network Component Analysis to infer transcriptional regulatory net- 
work of Escherichia coli (E-coli). They also provide whole genome expression data over time (n = 24), as well as 
information about the known regulatory network of E-coli. 
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Figure 7: Known and estimated networks for human cell signalling data. True edges (True Positives in estimated 
networks) are marked with solid blue arrows, while False Positives are indicated by dashed red arrows. 



Figure 8: Known and estimated transcription regulatory network of E-coli. Large nodes indicate the transcription fac- 
tors (TFs) and smaller nodes refer to their regulated genes (ORFs). True edges (True Positives in estimated networks) 
are marked with solid blue arrows, while False Positives are indicated by dashed red arrows. 

In this application, the set of transcription factors (TFs) are known a priori and the goal is to find connections 
among transcription factors and regulated genes through analysis of whole genome transcriptomic data. Therefore, 
the algorithm proposed in this paper can be used by exploiting the natural hierarchy of TFs and genes. Kao et al. 
(2004) provide gene expression data for 7 transcription factors and 40 genes regulated by these TFs. Figure 8 presents 
the known regulatory network of E-coli along with the networks estimated using three different methods as well as 
different measures of performance. The relatively poor performance of the algorithms in this example can be partially 
attributed to the small sample size. However, it is also known that no single source of transcriptomic data is expected 
to successfully reveal the regulatory networks and methods that combine different sources of data are considered to 
be more efficient. It can be seen that the PC-Algorithm can only detect one of the true regulatory connections, while 
the proposed algorithm, with both lasso, as well as adaptive lasso penalties, offers significant improvements over the 
results of the PC-Algorithm, mostly due to the significant drop in proportion of false negatives (from 97% for the 
PC-Algorithm to 63% for adaptive lasso). Although the estimates based on lasso and adaptive lasso penalties are very 
similar, the choice of the best estimate depends on the performance evaluation metric. 

7 Conclusion 

We proposed efficient penalized likelihood methods for estimation of the structure of DAGs when variables inherit 
a natural ordering. Both lasso and adaptive lasso penalties were considered in this paper. However, the proposed 
algorithm can also be used for estimation of adjacency matrix of DAGs under other choices of penalty, as long as the 
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penalty function is applied to individual elements of the adjacency matrix. 

There are a number of biological applications where the ordering of the variables is known a priori. Estimation of 
transcriptional regulatory networks from gene expression data and reconstruction of causal networks from temporal 
observations, based on the concept of Granger causality (Granger, 1969) are areas of potential application for the 
proposed algorithm. Our simulation studies also indicate that the correct ordering of variables is less crucial for 
estimation high dimensional sparse DAGs. Thus, even in high dimensional sparse applications where the ordering 
among variables is not known, the methods proposed in this paper may be an efficient alternative for search based 
methods of estimating DAGs. 



Appendix: Technical Proofs 

Recall from section 4.2 that 9 % = Ai^i denotes the entries of the ith row of the adjacency matrix to the left of the 
diagonal. Also, denote by 9 l ' X the estimate for the ith row with values outside the set of indices X set to zero and let 
be the j component of 9 1 ' 1 . 

The following lemma is a consequence of the KarushKuhnTucker conditions for the general weighted lasso prob- 
lem and is used in the proof of Theorems 4.3 and 4.5. 

Lemma 7.1. Let 9 l,x be the general weighted lasso estimate ofO 1 ' 1 , i.e. 



yi.X 



argmm 



- 1 \\x i -x9\\l + xJ2\9 k \ Wik \ (7.i) 



fc=i 



Let 



Gj(9) = -2n^ 1 X]{X l - X6) 



and Wi be the vector of initial weights in adaptive lasso estimation problem. Then a vector 9 with 9k = 0, Vfc ^ X is 
a solution of (7.1) iffVj € X,Gj(9) = — sign (6j)wijX if 9j ^ and \Gj(6)\ < WijX if 9j = 0. Moreover, if the 
solution is not unique and \Gj (9) \ < WijX for some solution 9, then9j = for all solutions of (7.1 ). 

Proof. The proof of the lemma is identical to the proof of Lemma (A. 1) in Meinshausen and Buhlmann (2006) (except 
for inclusion of general weights u»y) and is therefore omitted. □ 

of Theorem 4.3. To prove (i), note that by Bonferroni's inequality, and the fact that card (pai) = o(n) as n — > oo, it 
suffices to show that there exists some c^ > such that for all i 6 V and for every j € pai, 

pr |sign (9j Pai ) = sign (#*' pai ) j = 1 — {cxp (— c^n.^)} as n —> oo 

Let 9 l ' pai (f3) be the estimate of 9 l ' pai in (7.1), with the jth component fixed at a constant value [3, 



9iw(P)= aign dn\n- 1 \\X i -X6\\l+\J2\Qk\w k \ (7.2) 
3 " I fe=i J 



?ee a 



where 6 /3 = {9 e W : 9j = [3,9 k = 0,Vfc <£ pa,}. Note that for (3 = 9 l - pa \ 6» 4 ' pa *(/3) is identical to 9 1 ^ . Thus, 
if sign (8 l j Pat ) ^ sign (#]), there would exist some j3 with sign ((3) sign (#]) < such that 9 l,pai ((3) is a solution to 
(7.2). Since 6j ^ 0, Mj £ pai, it suffices to show that for all (3 with sign ((3) sign < 0, with high probability, 
gi,pa,i ^ can not k e a so i u ti on to (7.2). Without loss of generality, we consider the case where 9j > (Sj < can be 
shown similarly). Then if (3 < 0, from Lemma 7.1, 9 l,pai ((3) can be a solution to (7.2) only if Gj(9 l ((3)) > —Xwij. 
Hence, it suffices to show that for some > and all j G pai with 9* > 0, 



pr 



sup 

/3<0 



{G j (0*08)) < —Xwij} 



1 — O {cxp (— C(j)n^)} as n — > oo (7.3) 
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Define, 

For every j G pa\ we can write, 



n i (0) = x i -x0 i (fi) 

k£pa,i\{j} 



(7.4) 
(7.5) 



where Zj is independent of {X k ; k G pcii\{j}}- Then by (7.5), 

Gj (<?'(/?)) = -2n- 1 Zj^(/3) - ^ ^*°*\«>2n- 1 Ai T 72iG9) 

fcepa»\{i} 

By Lemma 7.1, it follows that for all k G pa 4 \{j}, \G k (<9 l (/3)) \ = \2n- 1 X k r K i (P)\ < Xw ik , thus, 

Gj (&(pj) < -2n~ l Z}n0) + A J2 \6 j ' paMj}l 



(7.6) 



k£pa,i\{j} 



Using the fact that |0-?> o AM < 1, it suffices to show that 



pr 



sup{-2n- 1 Z;7e. 1 (/3)} < -A ^ 



/3<0 



l-0{cxp(-c (4) n c )} 



as n — > oo, 



(7.7) 



or equivalently, 



pr 



inf {2n- 1 Z]TZ t {P)} < X Y w lk 

/3<0 ^— ' 



O {exp (— C(i)n^)} as n 



oo. 



(7.8) 



It is shown in Lemma A. 2. of Meinshausen and Buhlmann (2006) that for any q > 0, there exists > such that 
for all j G pai with 0j > 



pr 



inf {27i" 1 Z 1 T 7e j (/?)} < gA 

/3<0 3 ~ 



= O |exp(-C(j)n c )} 



as n — > oo. 



(7.9) 



However, by definition Wi k < 1 and therefore, J2kepa- w ik < card (pa,) < 1, which implies that (i) follows from 7.9. 

To prove (ii), note that the event pa i pat is equivalent to the event that there exists a node j G i — l \pai such 
that B) ^ 0, i.e. 

pr (p\ C pa,) = 1 - pr (sj G 7 - l\pa» : 0} ^ o) (7.10) 
By Lemma 7.1, and using the fact that by definition Wij > 1 



pr (3 j G i - l\pai ■ 0) + OJ = pr [3j G i-lXpoj : | Gj {6 l * a ' ) \ > w l3 A 

< pr (3j G i - l \pa.i : \G ] {0 l ' pa ')\ > qX and ly^A < qX for some g > 1 

< pr (3j G i — l \paj : Wij < q for some q > 1) 

Since to^ = 1 V ^ with 0* the lasso estimate of the adjacency matrix from (3.8), using Lemma 7.1, 

pr (3j G i — l \paj : Wij < q for some q > 0) = pr (3j G i__J_\pa; : \6j\ > g -1 / 7 for some q > 1 

< pr ^3j G z - l \pa; : > g' for some g' > 

< pr (3j G j - l\paj : 6) ± o) 
= pr G i - l\pa t : |G, (^ pa ' ) | > A 
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Since card (pai) = o(n), we can assume without loss of generality that card (pai) < n, which implies that l,pai is 
an almost sure unique solution to (7. 1) with X — pai. Let 



£={ max \Gj(6 i ' pai )\ < X°\ 



Then conditional on event £, it follows from the first part of Lemma 7. 1 that 9 1 ,pai is also a solution of the unrestricted 
weighted lasso problem (7.1) with 1 = i — 1 . Since 0j' pai = 0, Vj G i — l \pcn, it follows from the second part of 

Lemma 7.1 that #j =0, V? G i — l \paj. Hence 



pr [3j G i - 1W : 0! ) < 1 - pr(£) = pr max |G, (6< 4 < pai ) | > A (7. 1 1) 

V / \jei-l\paj ) 

where, 

Gj{6 i ' pai ) = -2n- 1 X]{X l - ,W' pa ') (7.12) 

Since card (V) = 0(n a ) for some a > 0, Bonferroni's inequality implies that to verify (7.10) it suffices to show that 
there exists a constant era} > such that for all j G i — l \pcn, 

pr (\G J {9 i ^)\ > A ) = O {cxp (-c (i4) 7i c )} as n -> oo. (7.13) 

where Rj ~ Af(0, cr|), cr? < 1 and i?j is independent from G pa^ Similarly, with i?; satisfying the same 

requirements as i?^, we get X l = ^Z k&pa . X k + 

Denote by X pai the columns of X corresponding to pai and let 8 pai the column vector of coefficients with dimen- 
sion card (pai) corresponding to pai. Then, 

pr{|Gj(6^ ai )| > A } = pr j| - 2n~ l X 1 J (X l - X9 l,pai )\ > A°| 

= pr [| - 2n- 1 {X paz d^ + 7^ J } T {A' pa! (^ - 0%?) + 72,} | > A 

Therefore, 

W {\G 3 (6^)\ > A } < pr{| - 2n- 1 {0%? ~ ^C^A^'I > \°fz} 



pr 



{| - 2n-\6 P ^ - e^ fx^TZ,] > A°/3} 
{| - 2n-\X pai e p ^ +TZ J ) 1 TZ l \ > A°/3} 



+ pr 
= I + II + III 

Let l pai denote a vector of l's of dimension card (pai). Then, using the fact that \6j' pai \ < 1, for all I G pai, we can 
writel < pr (2\\0g? - e^^n^iX^l^ X pai l pai > A°/3). Then A£, i X pai ~ W caxd(pai) (S pai ,n) where 
W m (S, n) denotes a Wishart distribution with mean nS. Hence, from properties of the Wishart distribution, we get 

(<^-pai~^-pai) ^-pcii^-pai ^ (lpa i ^pai Ipa^ ; 

Since pa^ also forms a DAG, the eigenvalues S pQi are bounded (see Remark 2.2), and hence 

^-pCLi P a i^-P a i — 

card (pai)4> max (T, pai ) (7.14) 

Therefore, if Z ~ x^, n 1 (^pa^pai) 1 X pai l pai is stochastically smaller than card (pai)0 max (S pa JZ. On the other 
hand, by Theorem 4.2, 

||A-i|| F = O p {(n- 1 S logp) 1/2 } 

and hence 

P P T - II- = P {(n-^logp) 172 } (7.15) 
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Noting that card (pa*) = 0(n b ) 1 b < 1/2 andp = 0(n a ), a > 0, (7.14) and (7.15) imply that 

ll^ 1 -^ l ||ooCard(p a O0max(S pai ) = O p {( S n 2h - 1 alogn) 1/2 } 

By (A-0'), sn 2h - x log n = o(l) and hence by Slutsky's Theorem and properties of the x 2 -distribution, there exists 
cm > such that for all j G i — V \paj, 

I = O {exp (— cn\n ^)} as n — ► oo 
Using a similar argument, for II we can write, 

II < pr (2n" 1 ||^ - 0jg* |U| J^^l > A°/3) . (7.16) 

Since columns of X pai correspond to samples from normal random variables with mean zero and are all independent 
of IZj, it suffices to show that there exists cm) > such that for all j G i — )\ paj and for all k G pa,, 

pr(2n- 1 ||^ i -^lUcard^OlAffe 7 ^! > A°/3) = O {exp (-c (n) n c )} as n ^ oo (7.17) 

By (7.15) and (A-0'), the random variable on the left hand side of (7.17) is stochastically smaller than 2n~ 1 \X}Jlj\. 
By independence of X^ and Rj, E(XkRj) — 0. Also using Gaussianity of both X}. and Rj, there exists g < oo 
such that E{exp (\XkRj\)} < g. Since A = (^{(logp/n) 1 ^ 2 }, by Bernstein's inequality (Van der Vaart and Wellner, 
1996), pr(2n _1 \X\Jtj\ > A°/3) < exp(— c/mn^) for some cm) > and hence (7.17) is satisfied. Finally, for III we 
have 

pr { | - 2n- 1 (X pat 9^ + Kiflljl > A°/3} = pr{| - 2n~ 1 X i 1 llj \ > A°/3} (7.18) 

and using the Bernstein's inequality we conclude that there exists cmn > such that for all j G i — l \paj and for all 
k G pa,i, III = O {exp (— cpmn^)} as n ^ oo. The proof of (ii) is then complete by taking C(,j) to be the minimum 

Of Cm, . . . ,C(ni). 

To prove (iii), note that pr (pa^ G pa.J = 1— pr (3j G pa^ : 0j = O^j, andletf = |max feei _ 1 \ pa . |G :) (0 J,pai )| < Aw^ 

It follows from an argument similar to the proof of (ii) that conditional on £, 9 l,pai is an almost sure unique solution 
of the unrestricted adaptive lasso problem (7. 1) with X = i — Therefore, 

pr (3j G pa, : 6) = o) < pr (lj G pa, : 9) = o) + pr (£ c ) . 

From (i), there exists a c\ > such that pr (^3j G pa^ : 9* = O^j = 0(cxp (— cin^)) and it was shown in (ii) that 

pr (£ c ) = Ojexp (~C2n^)} for some C2 > 0. Thus (iii) follows from Bonferroni's inequality. 

Finally, it is easy to see that since p = 0(n a ), the claim in (iv) also follows from (ii) and (iii) and Bonferroni's 
inequality. □ 

of Theorem 4.5. We first show that if ANi n ANj = 0, then i and j are independent. Since S = AA T and A is lower 
triangular, 

min 

S y - AikAj-fc (7.19) 

fe=i 

We assume without loss of generality that i < j. The argument for the case j > i is similar. Suppose for all 
k = 1, . . . , i that A^ = or Ajk = 0, then by (7.19) i and j are independent. However, by Lemma 2.1, Ajk is the 
influence of kth node on j, and this is zero only if there is no path from k to j. Clearly, if i is an ancestor of j, we 
have £jj ^ 0. On the other hand, if there is no node k G i — 1 such that k influences both i and j, (i.e. fc is a common 
ancestor of i and j) then for all k = 1, . . . , i we have AikAjk = and the claim follows. 
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Using Bonferroni's inequality twice and Lemma 7. 1, we get 

pr(3i e V : AN,. <t ANi) < pmaxpr (3j G i - l \ANj : j e p\) 

< p(i — 1) max pr (j £ pa A 

i£V,j£i-l\ANi 



/I > Atl>y 



< p(i — 1) max pr \ IG, 

teV,j€t-i\Aiy< L 

However, by definition u> ?J > 1, and hence it suffices to show that, 

(i-l)p max pr(|G 7 (^^ Wl )| > x) < a. (7.20) 
ieV,jei-l\ANj V J 

Note that Gj(6 l ' AN ') = -In^X^X, - X8 l ' AN >) and X 3 is independent of X k for all k e AN t . Therefore, 
conditional on Xan, , Gj ~ (0, 4R 2 /n), where i? 2 = ri^ 1 1|^ - X9 l ' AN '\\l < n^WXWl = 1, by definition 

of 0*, AN i anc [ the fact that columns of the data matrix are scaled. 

It follows that for all j e i - l \AN h pr {|Gj(0 i,AJVi )| > A | X AN ^ < 2{1 - $(n 1 / 2 A/2)}, where $ is the 
cumulative distribution function for standard normal random variable. Using the choice of A proposed in (4.1), we get 
pr | \Gj (ff l ' Am ) | > A | X ANi | < -jj^, and the result follows. □ 
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